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Abstract 

Recently nonparametric functional model with functional responses has 
been proposed within the functional reproducing kernel Hilbert spaces (fRKHS) 
framework. Motivated by its superior performance and also its limitations, we 
propose a Gaussian process model whose posterior mode coincide with the 
fRKHS estimator. The Bayesian approach has several advantages compared 
to its predecessor. Firstly, the multiple unknown parameters can be inferred 
together with the regression function in a unified framework. Secondly, as a 
Bayesian method, the statistical inferences are straightforward through the pos- 
terior distributions. We also use the predictive process models adapted from 
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the spatial statistics literature to overcome the computational limitations, thus 
extending the applicability of this popular technique to a new problem. Mod- 
ifications of predictive process models are nevertheless critical in our context 
to obtain valid inferences. The numerical results presented demonstrate the 
effectiveness of the modifications. 

Keywords: functional reproducing kernel Hilbert spaces, Gaussian predic- 
tive process models, Markov chain Monte Carlo. 



Ramsay and Silverman! ( 120021 ). In 



1 Introduction 

With recent ever-increasing interests and devoted efforts of many researchers, func- 
tional data analysis has developed into a full-fledged subarea of statistics. This surge 
of interests is explained by the ubiquitous examples of functional data that can be 
found in different application fields such as me dicine, economics, physi c s etc. , and 
many interesting applications can be found in 
such applications, use of specialized functional data analytic tools is preferable to 
using multivariate analysis on discretized finite-dimensional vectors, since with the 
former the smoothness property of the curves can be easily handled. With curves as 
the basic units of observation, analysis of functional data often provides interesting 
new statistical perspective and insights into modern data but also poses significant 
theoretical and practical challenges to the statisticians. 

The literature contains a wide range of tools for functional data including ex- 
ploratory functional principal component analysis, canonical correlation analysis, 
classification and regression. Two major approach es exist. The more tra d itiona l 



approach, carefully documented in the monograph 



Ramsay and Silverman! ( 120051 ) 



typically starts by representing functional data by an expansion with respect to a 



certain basis, and subsequent infer ences are carried out on the coefficients. Another 



line of work by the French school ( IFerraty and Vieu 



20061 ). taking a nonparametric 



point of view, extends the traditional nonparametric techniques, most notably the 
kernel estimate, to the functional case. 

In this paper we are concerned with the functional regression problems with the 
extra complication that the dependent variables are al so curves. Functional regressio n 



Ramsay and Silverman! ( 120051 ) 



models with functional responses have been studied in 
where two different linear models are proposed. More specifically, the integral model 
assumes that 

' (1) 



y (t) = a(t)+ p(s,t)x(s)ds + e(t), 
Jo 

where a(t) and f3(s,t) are unknown regression coefficients and e(t) is the mean-zero 
noise. Note that we will assume throughout the paper that all functions are defined 
on the unit interval [0, 1] with trivial extension to functions defined on any bounded 
interval on the real line. On the other hand, the pointwise model assumes that 



y{t) = a(t)+p(t)x(t) + e{t). 



(2) 



Although the parameters reside in infinite-dimensional spaces for both types of lin- 
ear models, in a func tional data conte x t bot h are usually deemed as parametric 



(see Definition 1.4 in 



Ferratv and Vieu 



gression modeling includ e 



Cardot and Sarda 



(12005); 



Cuevas et al. 



( 2006|) ) . Rec e nt wo r ks on parame t ric re - 



(2002 



Hall and Horowitz! (120071 ) ; ICrambes et al. 



( 120091 ): 



James (2002) 



Mul ler and Stadtmullerl (|2005l); iJank and Shmueli 



Cardot et al. 



Crainiceanu and Goldsmith! ( 120101 ) . 



(2003); 
(2006); 
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Nonparametric models for functional data only assumes a general regression form 



Hi = F(xi) + €i,i 



n 



(3) 



for a given independent and identically distributed (i.i.d.) sequence, where both yi 



and %i are functions. At least two approaches have 



first method uses a simple kernel regression estimate (IFerraty and Vieu 



appeared in the 



iterature. The 



20041 ) and the 



second metho d is to use the reproducing kernel Hilbert spaces (RKHS) framework 



(Preda 



20071 ). Both methods were originally proposed for functional models with 



scalar responses. Studies of nonparametric met hods f or fun ctional responses models 



Lianl (120071 ). where both the kernel 



are almost non-existent with the exception of 
method and RKHS method are extended for functional responses. Although it is 
straightforward to extend the kernel regressor to this more complicated situation, 



extension of RKHS requires a novel concept, the so-ca 



Lian 



led functional reproducing 



(120071 1 that the fRKHS estimator 



kernel Hilbert spaces (fRKHS). It was shown in 
performs better than the kernel estimator in functional response models. However, 
there are several limitations with the fRKHS framework: 



1. The algorithm presented in iLianl (120071 ) only computes prediction estimates, 
but performing statistical inferences is difficult. 



2. In 



Lianl (120071 ) the bandwidth parameters involved in the kernel functions are 



chosen based on heuristics leading to suboptimal estimates. 

Although both problems could possibly be tackled using the frequentist approach, 
here we propose a fully Bayesian method to solve these problems. This is made 
possible by a connection between the fRKHS framework and the Gaussian process 
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models. This type of connection is we ll-known in models with scalar responses 



(ICraven and Wahba 



1979 



Wahbal . 



1990) and we will present an extension in our 



context later. Besides, to overcome the computational limitations, we propose to 



adap t the predictive process models in the spatial statistics literature (IBanerjee et al. 



20081 ) and also suggest two different modifications of the predictive process models 
to improve the prediction coverage accuracy. The advantages of working within the 
Bayesian framework are obvious - the unknown parameters and unknown regression 
function itself can be treated in a unified manner, and inferences about arbitrary 
unknowns are straightforward using the posterior distributions. As far as we know 
this is the first Bayesian study on nonparametric functional regression analysis with 
functional responses. Other works in Bayesian fun c tional data analysis, inclu ding 



Thompson and Rosenl (120081 ) : 



Rodriguez et al. 



(|2Q09|); 



Scarpa and Dunsonl (120091 ). do 



not d eal with r e gressi o n problems with a fu nctional predictor. The studies carried 



out in 



Shi et al. 



(120070 : 



Shi and Wang 



(120081 ) are also quite different from ours, where 



the model is basically pointwise in nature and thus the predictors used are in fact 
multivariate instead of functional. 

The rest of the paper is organized as follows. In Section 2 we briefly review the 
fRKHS framework. This section serves only as a motivation for our Bayesian proposal 
and the readers can choose to skip it without seriously hindering understanding of the 
later sections. In Section 3, after pointing out the connection of the fRKHS estimator 
with the Gaussian process models, our fully Bayesian approach is presented with a 
Markov chain Monte Carlo (MCMC) algorithm for posterior inferences. Furthermore, 
the predictive process models and their modifications are proposed as computationally 
more efficient approximations. In Section 4, simulation examples as well as a weather 
data application illustrate the advantages of our approaches. Currently the data size 
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that we can deal with are limited to nT m 5000 where n is the sample size and T is 
the number of time points on which measurements are made on each function, simply 
because first of all we cannot store nT x nT matrices with larger dimensions (R will 
give an error message when attempting to do so). Using bigger computer systems 
can potentially solve this problem but this is outside the scope of the current paper. 
Finally we will conclude with some discussions in Section 5. 



2 Review of functional RKHS 

With scalar reponses y in the nonparametric model ([3]), the well-known RKHS esti- 
mator for the nonparametric regression problem is obtained as the minimizer of the 
regularized loss 

5>,-F(*,)} 2 + A||F||^ (4) 

i 

where if is a RKHS with associated inner product (-,-)h and induced norm || • \ \h- 
By definition, a RKHS H is a Hilbert space of real-valued functions in which the 
point evaluation operator L x : H — >■ R, L x (f) = f(x) is continuous. The Riesz repre- 
sentation theorem then implies the existence of a positive definite kernel K(-, ■) that 
satisfies the reproducing property (K(x, •), f)u = f{%)- Solving the optimization 
problem (j3J) over an infinite-dimensional problem is made feasible by the represen- 
ter theorem. Readers unfa miliar w i th th ese concepts can find more properties and 



discussions about RKHS in 



Wahbal ril990h . 



In regression models with functional responses, we start by assuming that y(-) 
belongs to some RKHS H. For simplicity we assume x(-) also belongs to the same 
space H although this is not necessary. Since F now takes values in H, we need an 
extension of the concept of RKHS. 
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Definition 1 \Lian . \200j) A functional RKHS H is a subset of {F : H — > H}. It is 

a Hilbert space, with inner product (-,-)-h> ^ n which the point evaluation operator is a 
bounded linear operator, i.e., L x : F — )■ F(x) is a bounded operator from % to H for 
any x. 

It can be shown from the definition that there exists a kernel associated with "H, and 
different from the traditional RKHS, this kernel is operator- valued. 

In practice, the observations for yi are made only on a grid < t\ < ti < 
■ ■ ■ < tx < 1. Note we assume that all responses are observed on the same grid 
although different observation time points for different responses can be handled 
without much difficulty (although with much more complicated notations). Let Y = 
(yi(ti), . . . , yi (tr), 2/2(^1), • • • , 2/n(^r)) T be the nT-di mensional ve ctor of responses. Un- 



der some assumptions that simplify the derivation, iLianl ( 120071 ) shows that with the 
minimizer of @ (here the RKHS H that appears in (jlj) should be replaced by the 
fRKHS H)), the fitted values Y can be represented as 



Y = (A ® K)(A ® K + \IY 1 Y, 



(5) 



where A = {a(\\xi — x_j||)}",- =1 , K = {k(\ti — tj\)}fj =l are n x n and T x T matrices 



respectively. 
L 2 norm. In 



Here a(-) a nd k(-) are two positive definite functions and 



is the 



Lianl (120071 ) as well as our current paper, we use the Ga ussian kerne l 
a(t) = exp{—t 2 /p1}, k{t) = exp{—t 2 /pl}. The details can be found in ILianl (120071 ). 



The choice of the kernel bandwidth param eters pi, P2 is critical for the performance 



but difficult to estimate. In 



Lian 



(120071 ). the author uses some simple heuristics 



and takes them to be the average distances between pairs of predictors and pairs of 
time points respectively. This heuristics-based choice produces encouraging numerical 
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results but is in general only suboptimal. 



3 Gaussian process models and approximations 

3.1 Gaussian process models 

In the previous section, the fitted response values with the fRKHS estimator are given 
by (jSJ). Consider now a zero- mean Gaussian process W(x,t) indexed by x G H and 
t G [0, 1] with covariance structure 

EW(x,t)W(x',t') = s 2 a(\\x-x'\\)k{\t-t'\),s 2 > 0. (6) 

We assume a Gaussian process model for the observations 

y i {t j ) = W{x i ,t j ) + e ij , (7) 

where ey ~ N(0, r 2 ) are independent observation noises with variance r 2 , which 
is often called the nugget effect in the spatial statistics literature where the Gaus- 
sian process is usually indexed by locations in R 2 . The Gaussian process model 
is also popular in the machine learning c ommu nity as a technique for nonparamet- 



ric smoothing (IRasmussen and Williams 



20061 ). Here the important difference from 



these works is that the process is indexed by an extra functional covariate, which 
makes it usable for functional data analysis, and unlike what is prevalent in the ma- 
chine learning literature, we take a fully Bayesian approach here. The covariance 
structure (jBJ) is directly motivated by the fRKHS estimator (as indicated in the fol- 
lowing proposition), although the construction of a new kernel as the product of 
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two other kernels is not novel in itself. We also acknowledge the possibility of us- 
ing more general non-separable kernels. However, this problem is quite complicated 
although some recent progresses have been ma de and can be f ound in the statist! 



cal li t erature dealing with spatio-t emporal data ( Mitchell et al. 



2008 



Rodrigues and Diggle 



2005 



Fuentes et al. 



20101 ) . We only consider the separable kernel due to its 



simplicity as well as its connection to the fRKHS estimator. 

The following proposition is an easy consequence of some well-known properties 
of the multivariate Gaussian distribution, and shows the connection between the 
Gaussian process model and the fRKHS estimator. 

Proposition 1 Assume the Gaussian process model stated as in Fty. For any fixed 
x G H,tE [0,1], let 



W(x, t) = E(W(x, t)\Y = (y 1 (t 1 ), . . . , y n (t T )) T , 9), 



which is the minimum variance unbiased linear estimate ofW(x, t) given Y . We have 

W = (A®K)(A®K + XI^Y 

with X = r 2 /s 2 , where W = {W(x u h), W(x 1 ,t 2 ), . . . , W(x x , t T ), ■ ■ ■ , W(x n , t T )) T . 

The result says that the posterior mode (or equivalently posterior mean) of the Gaus- 
sian process model (conditional on Y as well as parameters 8) is exactly the same 
as the fRKHS estimator, when the smoothing parameter A is appropriately chosen. 
It is an extension of existing results for smoothing splines to the fRKHS context. 
Such a result is not trivial however without the concept of fRKHS in the first place. 
This observation suggests that we can estimate functional responses models in a fully 
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Bayesian framework with the likelihood determined by the Gaussian process model 
(j7]), thus harvesting several advantages of Bayesian methods as mentioned in the in- 
troduction. In particular, joint posterior credible intervals would be difficult to obtain 
by other current methods. 

With Bayesian models, we need to assign prior distributions to 6 = (s 2 , r 2 , p 1 , p 2 ). 
Our general principle in choosing prior distributions is to include weakly constraining 
prior information whenever possible. Although it is advisable to use subjective prior 
information if possible, our simulation results show that the data can usually provide 
enough information on these parameters. Details of prior specification are left to the 
next section when we discuss specific simulation examples. 

Inferences in our model proceed by sampling from p(6\Y). We use simple random 
walk Metropolis steps with normal proposals on the logarithmic scale of the parame- 
ters, with proposal variances tuned manually to achieve reasonable acceptance rates. 
Prediction with a new functional covariate x at time t is based on samples from 
p(W(x,t)\8,Y), one for one with each sampled 9. Both steps involve the evaluation 
of Gaussian likelihood and thus the inversion of the covariance matrix for Y, which 
is equal to s 2 A £g> K + r 2 /, a nT x nT matrix. Even if the sample size n and the 
time points T are both small, their product can be big enough to make the inversion, 
which has a complexity that is cubic in nT, a limitation of the approach. 



3.2 Approximation with predictive process models 

To overcome the computational li mitations, predictive p rocess models are proposed to 



deal with large spatial data sets ( iBanerjee et al. 



20081 ). and can be directly adapted 



for our purpose here. Even though the predictive process models are used intensively 
in the spatial statistics literature, we believe this represents their first application in 
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functional response models. Consider sets X* = {xl, . . . , x^} and S* = {£*,...,£*} 
with m < n and q < T, which are usually called the "knots" and may and may not be 
a subset of the original covariates and time points. The predictive process is defined 
as the interpolant of the Gaussian process conditional on the process values at the 
knots, given by W{x,t) = E(W{x, t)\W*), where W* = {W(x, t),x £ X*, t E S*}. W 
itself is a Gaussian process with a degenerate covariance structure, in the sense that 
the joint (multivariate normal) distribution associated with multiple covariates and 
time points is in general singular. More specifically, let A* = { a(| |rc» — 
be the n x m kernel matrix on functional covariates, = {k(\ti — t*j\)}^_^._ l be 
the T x q kernel matrix on time points, A*, and K* s denote their transposes, and 
A** = {a(\\x* -^H)}^!^, K ^ = {k(\t* - tj\)Y i=1 9 =1 , then the covariance matrix 



of Y in the predictive process model is 



s 2 [A^ ® K^}[A^ ® K^] 1 [A* i ® K m ] + t 2 I. 



Determinant as well as inverse of such a matrix can be a ccomplished using Sherman 



Woodbury-Morrison-type comp utations ( 



Harvil 



matrices. It was also shown in iBanerjee et al. 



e, 



19971 ) in terms of only mq x mq 



(120081 ) that the predictive process is 



the best approximation for the parent process in some sense. 
For prediction under the predictive process, IBanerjee et al. 



(1200a ) suggested to 



first sample 9 from the posterior, then sample W* one for one with sampled 9 values 
and finally use the model ((Tj) with W replaced by W to sample the responses. We 
note however that given 9 one can marginalize out w* resulting in a more efficient 
prediction procedure. 
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3.3 Modifications of the predictive process model 

The approximation by the predictive pro cess is shown to only c ause a slight underesti 



mation of the prediction uncertainties in 



Banerjee et al. 



( 120081 ) . The underestimation 



is caused by the more restrictive covariance structure which leads to overconfident in- 
ferences under the posterior distribution. However, in our simulations, the empirical 
coverage of prediction intervals based on the predictive process is severely lower than 
the target value. The reason could be two-fold . Firstly, the sample s ize in our simu- 



fl2008f ). thus the effect 



lations is much smaller than that presented in iBanerjee et al. 
of approximation is more noticeable. Secondly, the nature of the Gaussian process 
indexed by functional covariates is very different from that indexed by spatial loca- 
tions. With the latter, the spatial domain is easily densely covered by a regular grid 
on the plane, while for the former, the covariates present in the data is more sparsely 
scattered in the space of possible covariate curves. A prediction to be made on a 
functional covariate far away from the training data should be very inaccurate what- 
ever estimation method used. However, for the predictive process, once the values of 
the process on the knots are fixed, there is no uncertainty associated with W what- 
soever, leading to severe under- coverage of the prediction intervals. In the following 
we propose two ways to solve this problem, one based on a simple post-processing of 
the predictive variance produced by the predictive process, the other a modification 
of the predictive model itself. These two methods perform similarly in practice. 

We first present the explicit covariance structure to further understand the ap- 
proximation involved in the predictive process. For the rest of this section, we assume 
that the parameters 9 are known or we condition on them. For the parent process, 
given x ts t on which the value of ytstiij) or W(x tst ,tj), 1 < j < T is to be predicted 
(for simplicity we assume that we aim to predict y ts t at the same time points as in 
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the training data), the joint distribution of W(xi,tj),l < i < n,l < j < T and 
W(x tst ,tj), 1 < j < T is given by the multivariate normal distribution 



p({W(x i ,t j )},{W(x tat ,t j )}) = N\0, 



£ ^,tst 
£tst, ^tst,tst 



(8) 



where £ = s 2 A®K, T, itst = s 2 A Mt ®K, £ tstj is its transpose and T, tstttst = s 2 A tstjtst K 
and we recall from the previous subsection that A )tst = (a(\\xi — x tst \\), . . . , a(||x n — 
a^ s t||)), for example. In the predictive process, we effectively replace £ by £ = 
s 2 [A )!)t £g> If ^[A** (g) if**] -1 ^^ <g) 2f*J, and similarly for other blocks in the covariance 
matrix. The predictive distribution for W(zt s t,^-) conditional on Y in the predictive 
process is thus 

q{{W{x tstl t 3 ), 1 < j < T}\Y) = N (£ tst> (£ + r 2 /)- 1 ^ £ tst , tst - £ tst ,(£ + r 2 I)-% tst 

(note that the inverse above is computed by the Sherman- Woodbury-Morrison for- 
mula during implementation). Since a(t) — > as t — > oo, the variance above is small 
if x ts t is far away from all {x*}, and we see mathematically the under- coverage effect. 
This is also counterintuitive: if the test point is far away from all training points, one 
would expect more uncertainties in prediction while the predictive process would be 
(incorrectly) overly confident. One simple solution would be to avoid approximating 
£tst,ts* by £tst,tst i n the lower right block in ([8]), resulting in the predictive distribution 

N (£ fst> (£ + r 2 I)- l Y, £ fst , tst - £^,(£ + T 2 iy l t Mt ^ . 

This is our first proposed modification method. 
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Our second modification consists of replacing all blocks in the covariance matrix 
of (jSJ), for example E, by the more accurate £ + diag(£ — £), where diag(-) is the 
diagonal matrix containing only the diagonal entries of the original matrix. This 
effectively uses £ as in the predictive process, but keep the diagonal entries from 
the parent process. Note that the Sherman- Woodbury-Morrison formula can still be 
applied in this case. 

The first modification proposed is slightly simpler than the second in that the 
only change made to the predictive process is at the final stage of prediction and 
in addition only the predictive variance is modified. However, conceptually, this 
modification actually results in an inconsistent Bayesian model, since the covariance 
structure for the training data and test data are different. The second method requires 
the entire inference procedure be modified, in particular the MCMC algorithm, but 
it is a valid Bayesian model using a special covariance structure. 



3.4 Selection of knots 

Determining optimal knots is a challenging problem. Since our focus here is more on 
the Bayesian model itself as well as the predictive process approximation, we simply 
use a random subs a mple for X* and a regular grid on [0, 1] for S*. For spatial data, 



Stevens and Olsenl (J2004J) showed that a regular grid is more efficient than simple 



random sampling. However, it is unclear how to choose a regular gri d in a function 



space, and thus random sampling is used for X*. It was shown in 



Banerjee et al. 



( 120081 ) that estimation is more sensitive to the number of knots than the design of 
knots locations. 

Finally, a larger number of knots is certainly preferable, but this should be 
weighted against the computation resources available. It is advisable that one repeat 
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the analysis with different numbers of knots, and compare the predictive inferences. 
Of course the range to search also depends on the computational resources. 



4 Numerical Examples 

4.1 Simulation from Gaussian process model 

The data are simulated from the Gaussian process model (J7|). We set the sample size 
to be n = 30 and an equi-spaced grid of T = 40 points on [0, 1]. Each functional 
covariate Xi is generated as a standard Brownian motion multiplied by 5 with a 
random starting position uniform in [0, 5]. Other parameters in the model are set to 
be pi = 20, p 2 = 0.2, s 2 = 2,r 2 = 0.05. A separate test set with sample size 200 is 
used for validation, where we also test on the same regular grid of 40 time points on 
the functional responses. For the prior distribution, we use the weakly informative 



inverse Gamma distributions IG 
t 2 respectively. Since 



Lian 



'2,3) and 1(7(2,0.1) for s 2 and the nugget effect 
(120071 ) demonstrated that setting p\ to be the mean of 
all \\Xi — Xj\\,i,j = l,...,n and similarly setting p2 to be the mean of |i, — tj\ 
produces promising results, we use the data-informed weak prior that is uniform on 
[pi/10, pi x 10] and [P2/IO, p2 x 10] respectively. These priors give sufficient support 
on a wide range of values for the bandwidth parameters. 

The analysis is carried out using the full Gaussian process model, as well as 
predictive process model with m = 30, q = 10 and m = 20, q = 40, together with 
the two modifications proposed. That is, we only consider knots for the functional 
predictors or knots for the time points, but not both. 

Our algorithm is implemented in R. For each simulation, four MCMC chains are 
run for 5000 iterations, and standard convergence diagnostics provided in the coda 
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package show that one can use the first 1000 iterations as the burn-in period. The 
parameter estimates as well as their Bayesian 95% credible intervals are shown in 
Table [TJ The MSE reported is the prediction error of the Gaussian process W on 
the test data based on posterior mean. Also reported is the empirical coverage of 
the 95% prediction intervals as well as their average lengths. Note these intervals 
are pointwise in nature. Using the first modification, the parameter estimates as 
well as the predictive mean are the same as the predictive process model without 
modification, and thus omitted from the table. 

Several observations can be made regarding the results reported in Table [H Firstly, 
setting m = 30, q = 10 produced better estimates than m = 20, q = 40, suggesting 
the functional predictor plays a more important role in estimation. The performance 
of the predictive process model with m = 30, q = 10 is similar to the full model. Sec- 
ondly, there is an unacceptable under-coverage for predictive process models without 
modifications. Thirdly, the predictive process model without modification (or with 
the first kind of modification) tend to overestimate s 2 , which is expected since the 
overly restrictive model requires a larger s 2 to fit the variability seen in the data. 
Finally, in terms of estimation errors, empirical coverage rates and interval lengths, 
the two kinds of modifications produce similar results and there is no clear winner 
between the two. 

In terms of computational time, the full model takes about 20 hours to produce 
the output involving 5000 iterations on our HP workstation xw4400 with Intel Core 
2 Duo E6700 processor and 2GB memory running R in windows XP, while with m = 
30, q = 10, the predictive model takes about 2 hours, with or without modifications. 
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4.2 Simulation from functional regression model 

In this simulation, we return to our initial goal of estimation and prediction in func- 
tional regression models with functional responses. The simulated data are now gen- 
erated from 



with x generated exactly as before and 6y iV(0,r 2 ) with r 2 = 0.2. This model is 
nonlinear in x and a linear functional regression model such as ([1]) would not work 
well. We again set the sample size to be n = 30 and number of time points T = 40, 
and another sample of size 200 is generated as test data. Prior distributions are 
specified the same as in the previous simulation. Simulation results are presented in 



Conclusions similar to those for the Gaussian process simulation can be drawn here 
based on the table and thus not repeated. Note that the Gaussian process model is 
used as a means to estimate the functional regression model, and thus we cannot talk 
about the true values of the parameters. However one can compare the parameter 
estimates of different approximations relatively to those produced by the full model. 
In particular, we see again that random sampling of the functional predictors leads 
to worse performances than choosing a small grid for the time points. Directly using 
the guide values p\ and pi in the fRKHS estimator results in a worse prediction error 
of 1.428, where cross-validation is used to choose the smoothing parameter. 

Next we compare the prediction errors of other competing methods. Table [3] shows 
the prediction errors of different approaches on exactly the same data. Listed in the 
table are the results obtained using integral model (pQ) (LININT), pointwise model 
([2]) (LINPT), as well as the versions when using x 2 (t) in place of x(t) in these lin- 





Table EJ 
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ear models (LININT2 and LI NPT2 respectively ) 



kernel estimator (KERNEL) (IFerraty and Vieu 



Also shown are the nonparametric 



20041 ). as well as an "oracle" version 



of the kernel estimator (OKERNEL) where the noiseless functional responses (that 
is, using the responses from ([9]) without adding noises e^) are used. The smoothing 
parameters in all the linear models (penalizing the second derivatives of the coeffi- 
cients) , as well as the bandwidth parameters for the kernel methods (using Gaussian 
kernel), are obtained by searching over a fine grid and the parameters achieving the 
smallest prediction errors are used, thus giving some advantages to these methods. 
From the table, we see that except for the method LININT2, which performs best 
since the correct linear structure is specified, the nonparametric methods outperform 
the parametric linear methods. Using the simple kernel methods however produce 
much bigger prediction errors compared to using our Bayesian method. 

Since the functional regression model is really what we are interested in here, we 
take this opportunity to provide further insights into the extremely low empirical cov- 
erage rate of the predictive process model without modification. We have argued that 
the lower coverage rate might result from the fact that the predictive process models 
cannot correctly extrapolate to functional predictors far w ay from the majority of the 



covariates seen in the training data. For functional data, 



Lopez-Pintado and Romo 



(120091 ) introduced a novel notion of data depth that quantifies the outlyingness of any 
curve with respect to a reference set of curves. Roughly speaking, a curve that is a 
outlier compared to the reference curves has small depth. We compute the depth of 
the 200 functional covariates in the test data with respect to the functional covariates 
in the training data. For each of the 200 covariates, we also compute separately the 
empirical coverage rate based on the T = 40 time points for each covariates. The 
relationship between the depth and the coverage rate is shown in Fig[TJ A clear posi- 
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tive correlation is seen suggesting that the coverage rate is lower for new data farther 
away from the training data. 

4.3 Illustration with the weather data 

The daily weather data consist of daily temperature and precipitation measurements 
recorded in 35 Canadian weather stations. These are actually averages over the years 
from 1961 to 1994, applying a correction for leap years. The smoothed data are plotted 
in Fig [2] (produced with the sample code provided in the fda package). With n = 35 
and T = 365, R cannot even allocation enough memory to store an nT x nT matrix on 
our workstation. Thus we subsample the data and use only the weekly measurements 
(measurements from every 7th day), and then each observation consists of functional 
data observed on an equi-spaced grid of 53 points. We treat the temperature as the 
independent variable and the goal is to predict the corresponding precipitation curve 
given the temperature measurements. As is previously done, we set the dependent 
variable to be the log transformed precipitation measurements, and a small positive 
number is added to the values with precipitation recorded before taking logarithm. 
The priors are set as in the simulation studies. We also tried several different priors. 
We use priors IG{2, 0.5), IG(2, 3) or IG(2, 10) for s 2 and priors IG(2, 0.02), IG(2, 0.1) 
or IG(2, 1) for r 2 . The results obtained using these priors are almost identical. 

The prediction using the full Gaussian process model is shown in Fig |3] while that 
using prediction process models with the first kind of modification is shown in Fig HJ 
These four stations are left out when training is performed on the rest. With much 
savings in computation for the latter, the predictions made by these two are very 
similar. One visual difference is that in the prediction for the Edmonton station, the 
peak in precipitation produced from the predictive process model is less conspicuous. 
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Using the second kind of modification produces results almost identical to those shown 
in Fig HJ The large error for Pr. Ruppert is due to the unusual wet weather there, 
which can also be visually identified as the outlier in the right panel of Fig [21 

For this dataset, the prediction errors on those four stations are compare to those 
obtained using the integral model ([T|) (Table HJ). The Bayesian method outperforms 
the linear method for three of the four stations, with the most drastic improvement for 
the Pt. Ruppert station. This can be explained by the outlyingness of the responses 
for this station, since with outlying observations, the parametric model is severely 
misspecified while the nonparametric model is more robust with less assumptions 
imposed. The results obtained for the linear pointwise model are much worse and 
thus not shown here. 

5 Conclusion 

We have proposed in this paper a fully Bayesian approach to fitting a nonparametric 
functional regression model with functional responses. The Gaussian process model 
we have proposed is motivated by the theory of fRKHS but can also be constructed 
directly. We show the connection between the two and thus extend the existing knowl- 
edge on their relationship to the case of regression models with functional responses. 
To ease the computational burden, which depends on the product of the sample size 
and the number of time points and thus more of a problem here than functional re- 
gression with scalar responses, we adapt the predictive process models for our purpose 
and also propose two modifications to address the problem of under- coverage for the 
prediction intervals. We note that the under- coverage problem is particularly severe 
here compared to the previous findings in spatial data analysis, and we provided some 
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Table 1: Simulation results for data generated from the Gaussian process model. 





2 

a 




T 1 


PI 




PI 


MSE 


Coverage 


Length 


full model 


2.399 




0.051 


20.96 




0.203 


0.8360 


95.46% 


3.655 




(1.88,3.08) 


(0 


046,0.055) 


(18.6,23.1) 


(0 


191,0.215) 








m = 30, q = 10 




















predictive process 


2.642 




0.051 


21.44 




0.204 


0.8374 


33.35% 


0.476 




(2.03,3.49) 


(0 


046,0.055) 


(19.42,23.57) 


(0 


194,0.215) 








modification 1 
















95.78% 


3.756 


modification 2 


2.233 




0.050 


20.22 




0.205 


0.8356 


95.63% 


3.659 




(1.713,3.005) 


(0 


046,0.054) 


(17.86,23.09) 


(0 


197,0.216) 








m = 20, q = 40 




















predictive process 


4.830 




0.298 


28.78 




0.232 


1.008 


42.91% 


0.750 




(3.39,6.72) 


(0 


273,0.325) 


(27.18,30.21) 


(0 


203,0.252) 








modification 1 
















96.60% 


4.400 


modification 2 


2.621 




0.053 


23.52 




0.205 


1.052 


94.30% 


4.612 




(2.090,3.264) 


(0 


048,0.058) 


(21.26,27.12) 


(0 


194,0.216) 









Table 2: Simulation results for data generated from the functional regression model. 





s 1 


t 1 


PI 




P2 


MSE 


Coverage 


Length 


full model 


9.310 


0.195 


44.82 




0.362 


1.055 


95.63% 


2.868 




(5.15,20.00) 


(0.17,0.21) 


(37.02,60.40) 


(0 


319,0.406) 








m = 30, q = 10 


















predictive process 


9.675 


0.198 


42.59 




0.353 


1.109 


50.64% 


0.596 




(4.85,18.94) 


(0.179,0.219) 


(35.25,55.69) 


(0 


291,0.401) 








modification 1 














94.96% 


2.800 


modification 2 


9.718 


0.196 


42.31 




0.356 


1.116 


95.04% 


2.832 




(5.28,20.93) 


(0.18,0.21) 


(37.32,49.58) 


(0 


319,0.405) 








m = 20. q = 40 


















predictive process 


33.031 


0.282 


86.16 




0.360 


1.146 


40.44% 


0.605 




(11.19,75.69) 


(0.252,0.303) 


(77.86,97.32) 


(0 


336,0.385) 








modification 1 














96.64% 


3.370 


modification 2 


14.974 


0.197 


68.03 




0.334 


1.155 


94.59% 


3.128 




(9.24,24.03) 


(0.175,0.221) 


(57.20,80.87) 


(0 


299,0.372) 









insights into this problem. 

We have tried using a Matern covariance function which is more general than the 
Gaussian covariance function used here, but the simulation on functional regression 
model does not show any improvement. On another direction, one can use Gaussian 



proces s models with a non-zero mean. As indicated in 



ShietaL 



(120071 1: 



Shi and Wang 



( 120081 ). this mean structure can be beneficial when some extra scalar predictors exist. 
We have also tried this but do not see such advantages of modeling mean curve 
together with the covariance structure in our simulation. 



Table 3: Comparison of prediction errors using different functional regression meth- 
ods. 



LININT 


LININT2 


LINPT 


LINPT2 


KERNEL 


OKERNEL 


3.571 


0.376 


3.742 


3.988 


2.070 


1.993 
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Table 4: Prediction errors for the four weather stations using the Gaussian process 
model, the predictive process approximation, and the linear integral model ([Q) . 





Montreal 


Edmonton 


Pr. Ruppert 


Resolute 


Gaussian process 


1.22 


0.54 


18.85 


0.10 


Predictive process 


1.23 


0.65 


19.37 


0.11 


Integral model 


1.10 


0.95 


31.23 


0.18 




Figure 1: Each of the 200 test case is represented as a point showing its depth and 
empirical coverage rate on 40 time points. 
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Figure 2: Daily weather data for 35 Canadian stations. The curves plotted here result 
from using smoothing splines to fit the data. 
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Figure 3: Precipitation data (points), prediction estimate (solid) as well as pointwise 
credible intervals (dashed) using the full Gaussian process model. 
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Figure 4: Precipitation data (points), prediction estimate (solid) as well as pointwise 
credible intervals (dashed) using the predictive process model with the first kind of 
modification. 
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